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■ Abstract 

Let T = to . . . tn^i be a text and P = po . . .pm-i a pattern taken from some finite 
alphabet set S, and let d be a metric on S. We consider the problem of calculating the 
' ', sum of distances between the symbols of P and the symbols of substrings of T of length 

' m for all possible offsets. We present an e-approximation algorithm for this problem 

J , which runs in time 0(^n • polylogfn, ISI)). 

1 Introduction 

> 

, String matching, the problem of finding all occurrences of a given pattern in a given text, 

I is a classical problem in computer science. The problem has pleasing theoretical features 

and a number of direct applications to "real world" problems. 

CD I Advances in multimedia, digital libraries, and computational biology, have shown that a 

much more generalized theoretical basis of string matching could be of tremendous ben- 
efit [?, ?]. To this end, string matching has had to adapt itself to increasingly broader 
definitions of "matching" . Two types of problems need to be addressed - generalized match- 
^ ' ing and approximate matching. In generalized matching, one seeks all exact occurrences of 

' the pattern in the text, but the "matching" relation is defined differently. The output is 

all locations in the text where the pattern "matches" according to the new definition of a 
match. The different applications define the matching relation. Examples can be seen in 
Baker's parameterized matching ([?]) or Amir and Farach's less-than matching ([?]). The 
second model, and the one we are concerned with in this paper, is that of approximate 
matching. In approximate matching, one defines a distance metric between the objects (e.g. 
strings, matrices) and seeks to calculate this distance for all text locations. Usually we seek 
locations where this distance is small enough. 



OO 
O 



One of the earliest and most natural metrics is the Hamming distance, where the distance 
between two strings is the number of mismatching characters exists algorithm calculating 
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this distance exactly [?] and approximating it [?]. Levenshtein [?] identified three types of 
errors: mismatches, insertions, and deletions. These operations are traditionally used to 
define the edit distance between two strings. The edit distance is the minimum number of 
edit operations one needs to perform on the pattern in order to achieve an exact match at 
the given text location. Lowrance and Wagner [?, ?] added the swap operation to the set 
of operations defining the distance metric. Much of the recent research in string matching 
concerns itself with understanding the inherent "hardness" of the various distance metrics, 
by seeking upper and lower bounds for string matching under these conditions. 

A natural subset of these problems is when the distance is defined only on the alphabet, 
and therefore, the distance between two strings is the sum of the distances between the 
corresponding characters in both strings. It is possible to solve this problem in time 0{n-m) 
by employing the naive approach of summing the distances between each character of the 
pattern, and its corresponding character in the text, for each possible alignment of the 
pattern. This problem was first defined by Muthukrishnan in [?] and has been open since. 
In this paper we present an approximation algorithm for this problem. 

This algorithm consists of two parts: the first part is a preprocessing phase in which random 
hash functions on the alphabet is constructed. We use same hashing which Bartal used for 
tree embedding at [?]. Wc use this hashing in order to separate the places where distance 
between letters is large and places where this distance is small. 

The second part of the algorithm is an application of sampling ([?]), which allows us to 
give an approximation of the distance between the text and the pattern, in time 0{-pn ■ 
polylog(n, ISD). 

The contributions of this paper are twofold: on the technical side, we have solved a problem 
that has been open for over a decade, by presenting the fastest known approximation algo- 
rithm for many metrics; additionally, and this is perhaps the more important contribution 
of this paper, we have identified and exploited a new technique - sampling, that has been 
used in some recent papers ([?]) only implicitly. We employ sampling in a much more so- 
phisticated manner and show how to use this important tool for approximating distances. 
We also present a novel way of using embeddings and geometry tools in pattern matching. 
This technique possesses a wide range of applications. For example, one can easily extend 
it to calculate the ^2-norm distance (in other words, when 



is the distance measure), or it can be extended for many infinite metrics. Our algorithm 
also allows for symbols in a text to be wildcards. 

We believe that this new method for solving approximate string matching problems - em- 
bedding metric in some suitable space and sampling - may actually yield efficient algorithms 
for many more problems in the future. 
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2 Problem definition and Preliminaries 



Definition 2.1. A metric space is a pair {X, d) where X is a set of points and d: X xX ^ 
R"^ is a metric satisfying the following axioms: 

d{xi,X2) = d{x2,xi). 

d{xi,X3) < d{xi,X2) + d{x2, X3). (2) 

d{xi,X2) = <^ Xi = X2- 

Let (S, d) be a metric space, and A = ao . . . am-i ,B = bo . . . bm~i be any two strings of 
the same length with symbols from S. We define the distance of A and B by Dist(A, B) = 
^^i^^ d.{ai,bi). Given a text T = to...tn-i and a pattern P = pQ . . .pm-i, our goal 
is to calculate the array S[i] = Dist(T[i..i + m — 1],P), for each possible offset i = 
0, . . . ,n — m — 1. Calculating the exact values of S can be done in 0{nra) time, using the 
naive approach. In most cases it is enough to know only an approximation of the distance; 
we therefore present an efficient algorithm which approximates the values of S. 

Convolutions can be used in the standard fashion to improve the time for finite fixed alpha- 
bets. 

Definition 2.2. Let A[Q\, . . . , A[n — 1] and i?[0], . . . , B[m — 1] be arrays of natural num- 
bers. The discrete convolution (polynomial multiplication) of A and B isV where: 

m—l 

V[i\ = Y,A[i-j]B\j] , (3) 

j=0 

where i = 0, . . . ,n — m. We denote V as A* B. We will choose A = T and B = P, i.e. we 
will treat T and P as the coefficients of polynomials of degrees n — 1 and m — l, respectively. 

By standard tricks, namely, (1) reversing the text to obtain T^; (2) calculating * P; (3) 
reverse the result; (4) discard the first m — l values, and last n — m + 1 values of the result, 
we obtain an array V = {T^ * P)^[m — 1 . . n — 1] where for each i, 

m—l 

V[i] = J2 ti+jPj ■ (4) 

In other words, for every possible offset i, V[i] is the sum of the pattern symbols multiplied, 
each with its corresponding text symbol. For convenience, we defineT®P= (T^*P)^ \m — 
1 . . n - 1] 

A convolution can be computed in time O(nlogm), in a computational model with word 
size O(logm), by using the Fast Fourier Transform (FFT) [?]. 

Remark 2.3. Using FFT we can compute general pattern distances, in time 0{\T.\ n logm), 
by using the following m,ethod: for every a (z Ti, we define an array Xa{B) by setting 
Xa{P)[i] = Xa{P[i\), where Xa{x) = 1 if x = a and otherwise. Set Ta[i] = d{a,ti). 
Computing Ta Xa(-P) gives us the sum of the distances of the letter a from the text. The 
sum of all convolution results is the desired distance. 
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We provide some required definitions regarding metric spaces: 

Definition 2.4. Let p be a mapping p: X Y, where {X,di) and (y, oltg metric 
spaces, p is an isometry if\/x,y G X di{x,y) = d2{p{x), p{y)). 

Unfortunately, we cannot always construct an isometry. Therefore we consider weaker 
conditions: 

Definition 2.5. Given two metric spaces {X, di) and {Y, d2), and a value c>l, a mapping 
p : X ^ Y is called a c-distortion embedding, if for all x,y E X, 

di{x,y) < di{p{x),p{y)) <c-di{x,y). (5) 



3 The One-Mismatch Algorithm 

In this section we describe the one-mismatch algorithm, in itself a very useful general tool 
in pattern matching. The one- mismatch algorithm had been described before in [?, ?]. 

Given a numeric text T and a numeric pattern P, and we want to find exact matches of P 
in T. One way to do so is by calculating for each location 0<i<n — m — I the value: 

m—1 m 
j=0 j=0 

Notice this sum will be zero iff there is an exact match at location i. Furthermore this sum 
can be computed efficiently for all z's in 0{nlogm) time using convolutions. Notice that 
if P and T are not numeric then an arbitrary one-to-one mapping can be chosen from the 
alphabet to the set of positive integers N. 

This method can be extended to the case of matching with "don't cares" [?], by simply 
calculating instead 

m—1 

M'i] = P'jt'i+jiPj - U+jf , 

j=0 

where p'j = (rcsp. t'j = 0) if pj (rcsp. tj) is a "don't care" symbol and 1 otherwise. 
Wherever there is an exact match with ""don't cares" this sum will be exactly 0. This can 
also be computed with convolutions in time O(nlogm). 

Again, this scheme can be further extended to the one-mismatch problem, which is to 
determine if P matches T[i . . i + m — 1] with at most one mismatch. Furthermore, we can 
identify the location of the mismatch for each such i. This is done by also computing, for 

each i, 

m—1 

i=o 

by using the convolution. Then if p matches the text at offset i with one mismatch then 
eventually Aq [i\ = {pr — tiJ^rY and Ai [i\ = r{pr — tj+r)^ where r is a location of a mismatch. 
Therefore, by calculating Ai[i]/^oH) we find the supposed mismatch location and verify 
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it. Finally, locations where exact matches occur will be labeled "match", location where a 
single mismatch occurs will be labeled with its location, and location where more than one 
mismatch occurs will be labeled _L. The one-mismatch algorithm is therefore as follows: 



1. compute the array 



m 



3=0 



using FFT. 



2. compute the array 




j=0 



using FFT. 

3. If Ao[i\ = set B[i\ ^ "match" else set B[i\ ^ Ai[{\/Ao[i\. 

4. For each i s.t. B[i] / "match", check to see if - t[B[i] + i])'^ = Ao[i]. If this is 
not the case then set B[i] ^ ±. 

The running time of this algorithm is C'(nlogm). 

4 The Sampling Method 

In this section we present a general method referred to as the sampling method. It allows 
us, for every possible offset, to sample (i.e. choose) a random mismatch from the set of all 
mismatches w.r.t. this offset. We show how to utilize the previously described algorithm 
for this purpose. 

First fix some probability < g < 1, and define subpattern P* of P by: 



In the algorithm referred to as Sample(g, T, P), we simply create P* as defined in Q and 
run the one-mismatch algorithm on P* and T. Now, for every offset i, let nii be the number 
of mismatches between T and P w.r.t. this offset. The following lemma trivially follows: 

Lemma 4.1. Let B be the array returned by Sample(q,T,P). For some location i, 




Pi with probability 

(/>(don't care) otherwise. 



(6) 



Vi{B[i] 



'match") = (1 - g)™» 



and 



Vi{B[i] is a mismatch location) 



miq{l - q)' 



mi~l 
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Another important property of this algorithm is that the mismatch returned w.r.t. offset i 
is uniformly distributed over the set of all mismatches w.r.t. offset i. We will show how to 
use this algorithm to sample a random location for which the distance is not 0. Notice that 
for q ~ the probability of finding a mismatch w.r.t. offset i is 0(1). Therefore, we can 

enumerate on q 

= {2-n;=o"- Then, for every location i there exists some q which is ~ — . 
Therefore, the next algorithm finds for each location a mismatch with constant probability, 
which is uniformly distributed over all the mismatches. 

1. for q = 1; q > 1/m ; q = q/2 

2. Sample(q,T,P) 

3. For every offset i if a mismatch is found return it. 



5 Motivation for the Algorithm 



Remark 5.1. In this paper we assume that the ratio of maximal and minimal distances is 
bounded by B^- Therefore, w.l.o.g. we can assume that the minimal nonzero distance is 1, 
i.e. Vx,y € S, d{x,y) < Bd and d{x,y) > <^=^ d{x,y) > 1. That is because if Bj^m is 
the minimal distance, then we can use the metric n'^ instead. 

-^min 

A first naive approach to approximate the distance is as follows: say we wish to provide 
an approximation only for some offset i, and let X be a random variable which is equal to 
d{ti^j,Pj), where j is chosen uniformly from 0, 1, . . . ,m — l. We can sample X by choosing a 
random J and calculating d{ti^j,pj). The expectation of m-X is the desired sum. Therefore, 
the way to compute E(X) is sample X several times and return the average. The problem 
with this approach is that the variance of X may be very large: for example, if P matches 
T except for a few mismatches, then w.h.p. we will not sample even a single mismatch. 

The second attempt to reduce the variance of variable X, is to use the sampling algorithm 
described in Sect. SI As a result, X will be distributed only over locations where d{ti+j,pj) > 
0. That is because the sampling algorithm returns only relative locations j for which 
ti+j 7^ pj. This sampling approach reduces the variance of X, but still it may happen 
that for some offset i, all distances are very small except for a single one which is even 
greater than the sum of all others. With high probability, only the smaller distances will be 
sampled, thus affecting the final outcome. This approach can provide us with an algorithm 
which runs in 0{n ■ B^ ■ polylog(n, |S|)) time, however, B^ may be very large. 

All the above leads us to search for a way to sample only locations j for which, when 
some D is fixed, D < d(ti^j,pj) < 2D. Then, with an additional multiplicative factor of 
log(i?d), we can enumerate on D = {2*}|^q^'*, each step approximating the expectation of 
the variable X^ which uniformly ranges over {d{ti^j,pj) \ D < d(tj-|_j,pj) < 2D}. Notice 
that for any value a, 

{Pr{X=a) D <' ^9 7") 

else. 
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Hence it follows that 

B{X) = Ft{D <X < 2D)E{Xd) ■ 

D 

A hypothetic way to sample would be to design a mapping vr/j on E for which 

D < d{x,y) <2D ^ TTDix) + T^D{y) . (7) 

If so, we could have run the sampling algorithm on 7rd(T') and 7rd(P), applying VTd in 
the obvious way, and obtain samples of Xd. Then, the average of sampled distances will 
approximate E(X£)). The approximation of Pr(L' <X< 2D) would have been also simple: 
it is a number of approximated mismatches divided by m (i.e. the length of the pattern). 
Unfortunately, we cannot design such a mapping. However, we can design a set of mappings 
such that for a random mapping, this condition holds with high probability. 

6 Probabilistically Separating Hashing 

In this section, our goal is to construct a random hash tt for a given D such that ([7|) holds 
with good probability. The set of hash functions Tij) called C- Probabilistically Separating 
Hashing if it admits next two conditions: 

1. If the distance between x, y is greater than D, then they their hashing is different i.e. 
d(x, y)> D^y-K GTiv T^ix) / iriy). 

2. Vx,y G S, Fi^^nMx) + ^{y)) < C^. 

Bartal at [?] gave a construction of log |S [-Probabilistically Separating Hashing for finite 
metrics after it was extended for graphs embedded in real normed spaces at [?]. In section [8] 
we will give a simple construction for the case when alphabet is normed space M*^ with small 
d. 

Notice that we only need to build such a hashing only once for every alphabet. Therefore 
it can be done as a preprocessing measure. 

We are able to use vr/) in order to sample a subset of indices for which the distance is not 
too small. We will now show that this will also allow us to sample Xd as we desire. 

Lemma 6.1. Fix an offset i. Let A = {j \ Tr£){ti^j) ^ '^niPj))} be the set of mismatches 
under tto and B = {j \ D < d{ti^j,pj) < 2D} be the set of indices we are really interested 
in sampling from them. Then: 

1. E^TroeWB (1^1) = ^{^^) (where the expectation is over the choice of t^d) 

2. BQA and ^ < \B\ < 

Where S = YJj=o d{pj,ti+j) and Sd = J2jeB (^iPj^ii+j)- 
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Proof. By linearity of expectation: 

E(|A|)= j;Pr(7rB(t,+,)^^^(p,)). 

j 

By definition of Probabilistically Separating Hashing we have: 

j j 

This proves (1). 

B CA follows from theorem ED and ^ = J2jeB '^^^'i^'^'^ but 1 < '^^p^^^'+^^ < 2 for j G 5 
so ^ < |5| < ^ □ 



7 The Algorithm 

At this point, we have all the tools necessary in order to describe the algorithm. The 
algorithm is based upon the application of sampling algorithm, described previously, to 
the C-probabilistically separating hash provided in Sect. [H As a preprocessing phase, we 
construct for the metric space (E,d), samples of hashing ttd £ TId for D = 2*. The 
preprocessing algorithm therefore gets a metric space (S,d), where S is the alphabet and 
d is the metric on it, and produces the 0( ^ — r log ISI logm) hash functions vrn chosen 

' ^ varepsilon'^ 0110/ u 

at random from TCd. 

The main (i.e. query) algorithm gets a text T = ig^i • • • ^n-i and a pattern P = pq . . . Pm-i 
over the alphabet S. For a fixed offset i the result will be in ((1 — e)Si,{l + e)Si) with 
probability 1 — e~*. The output of the algorithm is an array R[0 . .n — m — 1] where R[i] is 
an e-approximation to S[i] = YlY=o ^(Pj^'^'i+j) i-^' 

Vi Pr(|P[i] - S[i] \ > eS\i]) < e"* (8) 

We will now outline the idea of the algorithm. We want to approximate 

mE{X) = EmPr(L> <X < 2D)E{Xd). 

D 

We will enumerate D, increasing it each time by a factor of 2, and approximate mPr(D < 
X < 2D)'E{X£)). Fix some D and some offset i, let as before A = {j \ TT£){ti^j) / ■K£){pj))}, 
where A depends on the random mapping tt^, and B = {j \ D < d(tj-|_j,pj) < 2D}. Recall 

I B I 

that B <^ A, and that -pj is not too small. 

In order to approximate E{Xd) we will use the sampling algorithm on 'K£){P) and tt£){T). 
We get a random element in A, and we check if this element is also in B. In order to 
approximate ^{Xd) we average the distances of elements found in B. 

In order to approximate mPr(D < X < 2D) = \B\ we use lemma HTTJ The probability 
that the sampling algorithm returns "match" is qq = E(l — (7)'"^', and the probability that 
it returns a mismatch from the set B is qi = \B\ qE(l — g)!"^!^^. So, \B\ = ^lii— 
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Let's assume that we run the samphng algorithm K times; then the total number of matches 
is mo ~ Kqo and the total number of elements in B is mi !=a Kqi. Let Mi be an array of 
the elements in B which were found, including repetitions of elements from B. \Mi\ = mi. 

We will approximate |-B| by "^^^^"'^^ because: 

I PI E(mi)(l-g) 

' <?E(mo) ' ^""^ 

and approximate B{Xd) by Therefore: 

mPr(L> <X < 2D)B{Xd) ^ 

J/ . ^ (10) 

qmn 



We will need to show that this approximation is narrow, i.e. that the variance of the 
approximation is small. In order to do so, we will need to choose q s.t. q ~ order 
to find such a q, we try a scries of g's, increasing by a factor of 2 each time, and choose q 
s.t. mo is large enough and g'mo is maximal. We prove that this produces a good q w.h.p. 

We now write the complete algorithm. Set K = 0{-^ ■ C ■ t). 



1 


for D = Bd ,0 >1 ■ D = D/2 do 




2 


for q = 1/2 ; q > ^ ; q = q/2 do 




3 


for iter = 1; iter < K: iter = iter + 1 do 




4 


Choose a random tt G TLd 




5 


run Sample(g, 7r(r), 7r(P)). Save the result as the iter-ik result for this q. 


6 


end for 




7 


end for 




8 


for all offset in text i do 




9 


Calculate mo{i,q) for all g's - the number of matches 




10 


Among all q such that mo > e~^ • K choose q{i) s.t. q{i)mo{i, 


q{i)) is maximal 


11 


Set Ml to be the set of distances between D and 2D for this 


q- 


12 


Calculate 




13 


end for 




14 


end for 




15 


for every offset i return R{i) = J2d '^D{i) 





Algorithm 1: General distance algorithm 



The running time of this algorithm is: 0(^nlog^ mlog |S| logi?d)- This is because the 
running time is mostly dominated by the Sampling function, which takes O(nlogm) time. 
The Sampling function is executed 0{-^ log |S| logmlogSj) times. 
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Theorem 7.1. For every offset i 




,-t 



(11) 



or in other words our algorithm returns e- approximation w.h.p. 
The proof of this theorem appears in the appendix. 

8 Explicit hashing constructions for normed spaces 

Now in this section let us construct explicit ^-Probabilistically Separating Hashing for 
normed space TZ'^ with Cp, 1 < p < oo norm. The main problem with previous construc- 
tions [?] is that this hashing can't be calculated efficiently and usually it takes 0{n?) time 
to calculate one hash function. In case that points not given in advance this may be bot- 
tleneck of the algorithm. An other reason why this construction important is: if we have 
d-Probabilistically Separating Hashing for space X and we have embedding f : Y X with 
distortion c then we can construct cd-Probabilistically Separating Hashing for space Y. The 
problem of embedding metric spaces to real normed spaces where deeply investigated. 

Our construction is the same for every norm Cp. Let e'be a vector of d independent random 
variables with uniform distribution on [0, 1]. Define: 




(12) 



Theorem 8.1. The above mapping ttd satisfies the next properties: 



1 



If the distance between x, y is greater than D ■ d^^^, then their mapping is different i.e. 
Vx, y e R^, \\x - y\\p > D ■ d^/P ^ 7rc(x) ^ TToiy). 



2. 



Proof. As follows: 



1. This is trivial: 



Dd^/P <\\x-y 



^ 3i \xi -yi\>D 
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2. If d{x,y) > ^ then this inequahty is trivial. Therefore assume d{x,y) < ^: 

Pr(7rD(x) 7^7rD(y)) < 

d 

^Prob(7ri3(x)i 7^ TTDivh) < 

i=l 
1=1 

E'' - yil _ p- y||i ^ tZp - 
D ~ D - dVp^) 

1=1 

□ 

We choose to use tt as our embedding, and notice tt is easy to calculate, assuming we already 
have the c-embedding a. This calculation can be done in time. 

Remark 8.2. Consider the set {TToix) \ x eT,}. While each of its members is a vector of 
length d, when comparing these vectors we are only interested in checking equality. There- 
fore, in order to save space, we can replace each vector with a unique number in {1, . . . , 

9 Conclusions 

We have presented the first non-trivial algorithm for the approximation of a large class of 
distances between text and pattern. We believe that the techniques we have presented here 
have a wide range of applications. A further interesting open question is to generalize these 
techniques to the case where the distance is not necessary a metric. 

A The proof of the algorithm 

Remark A.l. Here w.h.p. mean with probability more then 1 — e"* 

We will now prove that the algorithm indeed approximated the distances for each i w.h.p. 
We will only sketch the proof. 



Proof, (of Algorithm) Fix some offset i. Then for every D we set B = {j \ D < d{ti^j,pj) < 
2D} and A = {j \ ■Koiti+j) 7^ '^oiPj))} two sets. Notice that \A\ is a random variable. 

Claim A.l. W.h.p. for every D there exist q{D) s.t. mo{q) > e~^K and q ■ mo > e{\J^) 

Proof. There exist q s.t. ^(pl) — 1 — e(]A\) ' '^^^ probability of a match for this q is 
qo = E(l - by Jensen's inequality E(l - > (1 - g)^^'^!) > e~^. mo{q) have 

binomial distribution B{qo,K) and so w.h.p. mo > e~^K. For this q it also holds that 
q- mo> E( \ A^) ■ ^'-'^ 1 ^^^^ algorithm chose also holds that q ■ mo > D 
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Claim A. 2. There exist a constant mo = -E(mo) = KE(1 — g)'"^'. Such that. 

(1 - e)moiD) < mop) < (1 + e)moiD) 

w.h.p. 

Proof. This follows from the Chernoff bound. This is because mo is binomially distributed 
variable B{K,p) with p > e~^. □ 



Lets CD 



{l-q(D))D 



d{ti+j,pj) 



q{D)rno{D) ^ coustaiit and S{i) = X] Yl D 

igA/i(D) 

Claim A. 3. S{i) is close to R{i) w.h.p. i.e. 

(1 -e)i?(i) < S{i) < {l+e)R{i) 



Proof. We can represent R{i) as: 



{l-q{D))D d{t,+j,pj) 



D jeMi{D) 

By the previous claim mo(-D) is close to rhQ{D). 
Claim A. 4. 



q{D)mQ{D) D 



m—l 

E{S{i)) = Y,d{puti+j). 

j=0 



Proof. 



D 



But we know that: 



D 



CDE{\M^iD)\) _ (l-g(D))E(mi(D)) 

D q{D)r^o{D) 
{I - q{D))B{mi{D)) 
qiD)B{moiD)) 



By dl]) we have that 



. coE(|Mi(D)|) 
D 



\B\ = mPi{D <Xd< 2D) . So we have that: 



B{S{i)) = ^mPr(L> < Xd < 2D)E{Xd) 

D 
m—l 

E(X) = ^d(pi,ii+,) . 

3=0 



m 



□ 



(13) 



(14) 



(15) 



□ 
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Claim A. 5. There exists a universal constant C s.t. w.h.p. 

-2 



cd<Cj^cd\M,{D)\ 



holds w.h.p. 



Proof. The previous claim showed us that S{i) < ^ 2c£) \Mi{D)\ because < 2 for 

j G Mi{D). Therefore, it's enough to prove that CD < C^S{i) . By ED we know that 
qniQ > E(jA|y- have: 



(l-qmP ^ CB{\A\)D 

Cn = ^ 

q{D)r^o{D) " K 

ByEI\'E{\A\)D = 0{S -c-d). K = -c-d-t) So 

K ^ t ' 



(16) 



(17) 
□ 



We will state the following lemma without proving it because it follows from the Chernoff 
bound: 

Lemma A. 2. There exists a universal constant C s.t. for every sequence of independent 
random variables Xi,X2...Xn with 1 < Xi < 2, and a sequence of positive constants 



ci,C2, . . . Cn s.t. Ci < Ci . Then: 

i=l 

/ n n 

Y,Ci-Xi-E(^c,-Xi 



Pr 



i=l 



i=l 



>eE{Y^c^-XM < e 



Claim A.6. 



Pr( 



m— 1 

S dipi,ti+j] 

3=0 



> eS) < e~*. 



Proof 5 = Eci^ by definition 1 < ^ < 2, by IAS] E(5) 

n 

follows that Ci < C^- Q and therefore IA.2I proves the claim 
1=1 



(18) 

T,T=o^ ^(Pi^'ti+j) by[A5] 
□ 



□ 
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